Coarse grain parallel codes for solving sparse systems of linear algebraic equations can be developed in several different ways. The following procedure is suitable for some parallel computers. A preliminary reordering of the matrix is first applied to move as many zero elements as possible to the lower left corner. After that the matrix is partitioned into large blocks and the blocks in the lower left corner contain only zero elements. An attempt to obtain a good load-balance is carried out by allowing the diagonal blocks to be rectangular.
While the algorithm based on the above ideas has good parallel properties, some stability problems may arise during the factorization because the pivotal search is restricted to the diagonal blocks. A simple a priori procedure has been used in a previous version in an attempt to stabilize the algorithm. In this paper it is shown that three enhanced stability devices can successfully be incorporated in the algorithm so that it is further stabilized and, moreover, the parallel properties of the original algorithm are preserved.
The first device is based on a dynamic check of the stability. In the second device a slightly modified reordering is used in an attempt to get more nonzero elements in the diagonal blocks (the number of candidates for pivots tends to increase in this situation and, therefore, there is a better chance to select more stable pivots). The third device applies a P5-like ordering as a secondary criterion in the basic reordering procedure. This tends to improve the reordering and the performance of the solver. Moreover, the device is stable, while the original P5 ordering is often unstable.
Numerical results obtained by using the three new devices are presented. The well-known sparse matrices from the Harwell-Boeing set are used in the experiments. 相似文献
Three currently available concurrent language systems, Pascal-Plus, occam and Edison, are used to implement a controller for a robot arm. The robot arm allows real parallelism of operation within the movements of the arm. The feasibility and restrictions placed upon the resultant solution for each of the language systems is then analysed and discussed. A Petri-net solution is also presented for the generalized problem and it is shown that each of the solutions is a different folding of the general net. 相似文献
The development and implementation of systems for the more complex realtime image processing and scene understanding tasks, such as robot vision and remote surveillance, calls for faster computation than that possible using the traditional serial computer. The advent of VLSI has made feasible the consideration of more specialized processing architectures, designed to support these datarates, while keeping systems compact and relatively cheap. Two approaches are discussed: the use of a programmable processor array, and the customizing of image processing algorithms in silicon. This paper examines designs based upon each approach in the light of the techniques and constraints of VLSI. In particular we describe in some detail an example of a VLSI parallel array processor, the Grid (GEC rectangular image and data processor), and a number of special-purpose CMOS/SOS chips based on systolic design techniques. 相似文献
This paper describes two classifier systems that learn. These are rule-based systems that use genetic algorithms, which are based on an analogy with natural selection and genetics, as their principal learning mechanism, and an economic model as their principal mechanism for apportioning credit. CFS-C is a domain-independent learning system that has been widely tested on serial computers. CFS is a parallel implementation of CFS-C that makes full use of the inherent parallelism of classifier systems and genetic algorithms, and that allows the exploration of large-scale tasks that were formerly impractical. As with other approaches to learning, classifier systems in their current form work well for moderately-sized tasks but break down for larger tasks. In order to shed light on this issue, we present several empirical studies of known issues in classifier systems, including the effects of population size, the actual contribution of genetic algorithms, the use of rule chaining in solving higher-order tasks, and issues of task representation and dynamic population convergence. We conclude with a discussion of some major unresolved issues in learning classifier systems and some possible approaches to making them more effective on complex tasks. 相似文献
Parallel computers are having a profound impact on computational science. Recently highly parallel machines have taken the lead as the fastest supercomputers, a trend that is likely to accelerate in the future. We describe some of these new computers, and issues involved in using them. We present elliptic PDE solutions currently running at 3.8 gigaflops, and an atmospheric dynamics model running at 1.7 gigaflops, on a 65 536-processor computer.
One intrinsic disadvantage of a parallel machine is the need to perform inter-processor communication. It is important to ensure that such communication time is maintained at a small fraction of computation time. We analyze standard multigrid algorithms in two and three dimensions from this point of view, indicating that performance efficiencies in excess of 95% are attainable under suitable conditions on moderately parallel machines. We also demonstrate that such performance is not attainable for multigrid on massively parallel computers, as indicated by an example of poor multigrid efficiency on 65 536 processors. The fundamental difficulty is the inability to keep 65 536 processors busy when operating on very coarse grids.
Most algorithms used for implementing applications on parallel machines have been derived directly from algorithms designed for serial machines. The previously mentioned multigrid example indicates that such ‘parallelized’ algorithms may not always be optimal. Parallel machines open the possibility of finding totally new approaches to solving standard tasks—intrinsically parallel algorithms. In particular, we present a class of superconvergent multiple scale methods that were motivated directly by massevely parallel machines. These methods differ from standard multigrid methods in an intrinsic way, and allow all processors to be used at all times, even when processing on the coarsest grid levels. Their serial versions are not sensible algorithms. The idea that parallel hardware—the Connection Machine in this case—can lead to discovery of new mathematical algorithms was surprising for us. 相似文献
Prostate cancer accounts for one-third of noncutaneous cancers diagnosed in US men and is a leading cause of cancer-related
death. Advances in Fourier transform infrared spectroscopic imaging now provide very large data sets describing both the structural
and local chemical properties of cells within prostate tissue. Uniting spectroscopic imaging data and computer-aided diagnoses
(CADx), our long term goal is to provide a new approach to pathology by automating the recognition of cancer in complex tissue.
The first step toward the creation of such CADx tools requires mechanisms for automatically learning to classify tissue types—a
key step on the diagnosis process. Here we demonstrate that genetics-based machine learning (GBML) can be used to approach
such a problem. However, to efficiently analyze this problem there is a need to develop efficient and scalable GBML implementations
that are able to process very large data sets. In this paper, we propose and validate an efficient GBML technique——based on an incremental genetics-based rule learner. exploits massive parallelisms via the message passing interface (MPI) and efficient rule-matching using hardware-implemented
operations. Results demonstrate that is capable of performing prostate tissue classification efficiently, making a compelling case for using GBML implementations
as efficient and powerful tools for biomedical image processing. 相似文献
Single-Instruction Multiple-Data (SIMD) instructions provide an inexpensive way to exploit the Data-Level Parallelism in multimedia applications. However, the performance improvement obtained by employing SIMD instructions is often limited because frequently many overhead instructions are required to bring data in a form amenable to SIMD processing. In this paper, we employ two techniques to overcome this limitation. The first technique, extended subwords, uses four extra bits for every byte in a media register. This allows many SIMD operations to be performed without overflow and avoids packing/unpacking conversion overhead. The second technique, Matrix Register File (MRF), allows flexible row-wise as well as column-wise access to the register file. It is useful for many two-dimensional multimedia algorithms such as the (I) Discrete Cosine Transform, 2 × 2 Haar Transform, and pixel padding. In addition, we propose a few new media instructions. Experimental results obtained by extending the SimpleScalar toolset show that these techniques improve performance by up to a factor of 4.5 compared to a conventional SIMD instruction set extension. 相似文献
The systolic processing offers the possibility of solving a large number of standard problems on multicellular computing devices
with autonomous cells (Processing Elements—PEs). The resulting systolic arrays exploit the underlying parallelism of many
computationally intensive problems and offer a vital and effective way of handling them. Advances in technology and especially
in VLSI and FPGA have an ongoing contribution to the evolution of systolic arrays. Herein, a FPGA-based Systolic array prototype
implementing the Factorization stage of the Quadrant Interlocking Factorization—QIF (Butterfly) method is presented and the
corresponding time-complexities achieved are discussed. 相似文献
It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced. 相似文献